TD(0) Leads to Better Policies than Approximate Value Iteration

نویسنده

  • Benjamin Van Roy
چکیده

We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same fixed points as TD(0). Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Performance Loss Bounds for Approximate Value Iteration with State Aggregation

We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to using invariant distributions of appropriate policies ...

متن کامل

Non-Stationary Approximate Modified Policy Iteration

We consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration—a family of algorithms that can interpolate between Value and Policy Iteration—with an error at each iteration is known to lead to stationary policies that are at least 2γ (1−γ)2 -optimal. Variations of Value and Policy Iteration, that ...

متن کامل

Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations

We show how options, a class of control structures encompassing primitive and temporally extended actions, can play a valuable role in planning in MDPs with continuous state-spaces. Analyzing the convergence rate of Approximate Value Iteration with options reveals that for pessimistic initial value function estimates, options can speed up convergence compared to planning with only primitive act...

متن کامل

Differential Training of 1 Rollout Policies

We consider the approximate solution of stochastic optimal control problems using a neurodynamic programming/reinforcement learning methodology. We focus on the computation of a rollout policy, which is obtained by a single policy iteration starting from some known base policy and using some form of exact or approximate policy improvement. We indicate that, in a stochastic environment, the popu...

متن کامل

Image Restoration with Two-Dimensional Adaptive Filter Algorithms

Two-dimensional (TD) adaptive filtering is a technique that can be applied to many image, and signal processing applications. This paper extends the one-dimensional adaptive filter algorithms to TD structures and the novel TD adaptive filters are established. Based on this extension, the TD variable step-size normalized least mean squares (TD-VSS-NLMS), the TD-VSS affine projection algorithms (...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005